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Abstract 

Verbal comprehension questions appear very fre¬ 
quently in Intelligence Quotient (IQ) tests, which 
measure human’s verbal ability including the un¬ 
derstanding of the words with multiple senses, the 
synonyms and antonyms, and the analogies among 
words. In this work, we explore whether such tests 
can be solved automatically by the deep learning 
technologies for text data. We found that the task 
was quite challenging, and simply applying exist¬ 
ing technologies like word embedding could not 
achieve a good performance, due to the multiple 
senses of words and the complex relations among 
words. To tackle these challenges, we propose a 
novel framework to automatically solve the ver¬ 
bal IQ questions by leveraging improved word em¬ 
bedding by jointly considering the multi-sense na¬ 
ture of words and the relational information among 
words. Experimental results have shown that the 
proposed framework can not only outperform ex¬ 
isting methods for solving verbal comprehension 
questions but also exceed the average performance 
of the Amazon Mechanical Turk workers involved 
in the study. 

1 Introduction 

The Intelligence Quotient (IQ) test | |Stern, 1914[ is a test of 
intelligence designed to formally study the success of an in¬ 
dividual in adapting to a specific situation under certain con¬ 
ditions. Common IQ tests measure various types of abilities 
such as verbal, mathematical, logical, and reasoning skills. 
These tests have been widely used in the study of psychology, 
education, and career development. In the community of ar¬ 
tificial intelligence, agents have been invented to fulfill many 
interesting and challenging tasks like face recognition, speech 
recognition, handwriting recognition, and question answer¬ 
ing. However, as far as we know, there are very limited stud¬ 
ies of developing an agent to solve IQ tests, which in some 
sense is more challenging, since even common human beings 
could not always succeed in the tests. Considering that IQ 
test scores have been widely considered as a measure of in¬ 
telligence, we think it is worth making further investigations 
whether we can develop an agent that can solve IQ tests. 


The commonly used IQ tests contain several types of ques¬ 
tions like verbal, mathematical, logical, and picture ques¬ 
tions, among which a large proportion (near 40%) are verbal 
questions | |Carter, 2005 1 . The recent progress on deep learn¬ 
ing for natural language processing (NLP), such as word em¬ 
bedding technologies, has advanced the ability of machines 
(or AI agents) to understand the meaning of words and the 
relations among words. This inspires us to solve the verbal 
questions in IQ tests by leveraging the word embedding tech¬ 
nologies. However, our attempts show that a straightforward 
application of word embedding could not result in satisfac¬ 
tory performances. This is actually understandable. Standard 
word embedding technologies learn one embedding vector 
for each word based on the co-occurrence information in a 
text corpus. However, verbal comprehension questions in IQ 
tests usually consider the multiple senses of a word (and often 
focus on the rare senses), and the complex relations among 
(polysemous) words. This has clearly exceeded the capabil¬ 
ity of standard word embedding technologies. 

To tackle the aforementioned challenges, we propose a 
novel framework that consists of three components. 

First, we build a classifier to recognize the specific type 
(e.g., analogy, classification, synonym, and antonym) of ver¬ 
bal questions. For different types of questions, different kinds 
of relationships need to be considered and the solvers could 
have different forms. Therefore, with an effective question 
type classifier, we may solve the questions in a divide-and- 
conquer manner. 

Second, we obtain distributed representations of words and 
relations by leveraging a novel word embedding method that 
considers the multi-sense nature of words and the relational 
knowledge among words (or their senses) contained in dictio¬ 
naries. In particular, for each polysemous word, we retrieve 
its number of senses from a dictionary, and conduct cluster¬ 
ing on all its context windows in the corpus. Then we attach 
the example sentences for every sense in the dictionary to the 
clusters, such that we can tag the polysemous word in each 
context window with a specific word sense. On top of this, 
instead of learning one embedding vector for each word, we 
learn one vector for each pair of word-sense. Furthermore, 
in addition to learning the embedding vectors for words, we 
also learn the embedding vectors for relations (e.g., synonym 
and antonym) at the same time, by incorporating relational 
knowledge into the objective function of the word embedding 






learning algorithm. That is, the learning of word-sense rep¬ 
resentations and relation representations interacts with each 
other, such that the relational knowledge obtained from dic¬ 
tionaries is effectively incorporated. 

Third, for each type of questions, we propose a specihc 
solver based on the obtained distributed word-sense represen¬ 
tations and relation representations. For example, for analogy 
questions, we find the answer by minimizing the distance be¬ 
tween word-sense pairs in the question and the word-sense 
pairs in the candidate answers. 

We have conducted experiments using a combined IQ test 
set to test the performance of our proposed framework. The 
experimental results show that our method can outperform 
several baseline methods for verbal comprehension questions 
in IQ tests. We further deliver the questions in the test set to 
human beings through Amazon Mechanical TurlsQ- The av¬ 
erage performance of the human beings is even a little lower 
than that of our proposed method. 

2 Related Work 

2.1 Verbal Questions in IQ Test 

In common IQ tests, a large proportion of questions are ver¬ 
bal comprehension questions, which play an important role 
in deciding the final IQ scores. For example, in Wechsler 
Adult Intelligence Scale | | Wechsler, 2008| , which is among 
the most famous IQ test systems, the full-scale IQ is calcu¬ 
lated from two IQ scores; Verbal IQ and Performance IQ, and 
around 40% questions in a typical test are verbal comprehen¬ 
sion questions. Verbal questions can test not only the verbal 
ability (e.g., understanding polysemy of a word), but also the 
reasoning ability and induction ability of an individual. Ac¬ 
cording to previous studies | |Carter, 2005 1, verbal questions 
mainly have the types elaborated in Table[l] in which the cor¬ 
rect answers are highlighted in bold font. 

Analogy-I questions usually take the form “A is to B as 
C is to ?”. One needs to choose a word D from a given list 
of candidate words to form an analogical relation between 
pair (A, B) and pair (C, D). Such questions test the abil¬ 
ity of identifying an implicit relation from word pair (A, B) 
and apply it to compose word pair (C, D). Note that the 
Analogy-I questions are also used as a major evaluation task 
in the wordivec models | |Mikolov et al, 2013| . Analogy-II 
questions require two words to be identihed from two given 
lists in order to form an analogical relation like “A is to ? as 
C is to ?”. Such questions are a bit more difficult than the 
Analogy-I questions since the analogical relation cannot be 
observed directly from the questions, but need to be searched 
in the word pair combinations from the candidate answers. 
Classification questions require one to identify the word that 
is different (or dissimilar) from others in a given word list. 
Such questions are also known as odd-one-out, which have 
been studied in | |Pinter et al, 2012] . Classification questions 
test the ability of summarizing the majority sense of the words 
and identifying the outlier. Synonym questions require one to 
pick one word out of a list of words such that it has the closest 
meaning to a given word. Synonym questions test the ability 

'http://www.mturk.com/ 


of identifying all senses of the candidate words and selecting 
the correct sense that can form a synonymous relation to the 
given word. Antonym questions require one to pick one word 
out of a list of words such that it has the opposite meaning to 
a given word. Antonym questions test the ability of identify¬ 
ing all senses of the candidate words and selecting the correct 
sense that can form an antonymous relation to the given word. 


Although there are some efforts to solve 
mathematical, logical, and picture ques- 


tions in IQ test 

(Sanghi and Dowe, 2003 

Strannegard et al, 20I2| 

Kushmany et al, 2014 

Seo et al, 20I4| [Hosseini et al, 20I4| [Weston et al, 20I5|, 


there has been very few efforts to develop automatic methods 
to solve verbal questions. 


2.2 Deep Learning for Text Mining 

Building distributed word representa¬ 

tions i Bengio et al., 200^ , a.k.a. word embeddings, 
has attracted increasing attention in the area of machine 
learning. Different with conventional one-hot represen¬ 
tations of words or distributional word representations 
based on co-occurrence matrix between words such as 
LSA yPumais etal, 1988| and LDA | |Blei efa/.,2003l , 
distributed word representations are usually low-dimensional 
dense vectors trained with neural networks by maximizing 
the likelihood of a text corpus. Recently, to show its 
effectiveness in a variety of text mining tasks, a series 
of works applied deep learning techniques to learn high- 
quality word repres entations | Collobert an d Weston, 2008 1 
[Mikolov et al., 20T3] Pennington et al, 2014f 

Nevertheless, since the above works learn word rep¬ 
resentations mainly based on the word co-occurrence 
information, it is quite difficult to obtain high quality 
embeddings for those words with very little context in¬ 
formation; on the other hand, large amount of noisy or 
biased context could give rise to ineffective word embed¬ 
dings either. Therefore, it is necessary to introduce extra 
knowledge into the learning process to regularize the quality 
of word embedding. Some efforts have paid attention 
to learn word embedding in order to address knowledge 
base completion and enhancement ||Bordes ef a/., 201 1[ 
ISocher et al, 2013] [Weston ef a/., 201 3a| , and some other 
efforts have tried to leverage knowledge to enhance word 
representations ILuong et al, 2013[ [Weston et al, 2013b[ 


[Fried and Duh, 2014[ Celikyilmaz et al, 2015| . Moreover, 

all the above models assume that one word has only one 
embedding no matter whether the word is polysemous 
or monosemous, which might bring some confusion for 
the polysemous words. To solve the problem, there 
are several efforts like [ Huang et al, | [Tian et al, 2014[ 
[Neelakantan et al, 2014| . However, these models do not 
leverage any extra knowledge (e.g., relational knowledge) to 
enhance word representations. 


3 Solving Verbal Questions 

In this section, we introduce our proposed framework to solve 
the verbal questions, which consists of the following three 
components. 




















































Table 1; Types of verbal questions. 


Type 

Example 

Analogy-I 

Isotherm is to temperature as isobar is to? (i) atmosphere, (ii) wind, (iii) pressure, (iv) latitude, (v) current. 

Analogy-Il 

Identify two words (one from each set of brackets) that form a connection (analogy) when paired with the words in capitals: CHAPTER (book, verse, read), ACT (stage, audience, play). 

Classification 

Which is the odd one out? (i) calm, (ii) quiet, (iii) relaxed, (iv) serene, (v) unruffled. 

Synonym 

Which word is closest to IRRATIONAL? (i)intransigent, (ii) irredeemable, (iii) unsafe, (iv) lost, (v) nonsensical. 

Antonym 

Which word is most opposite to MUSICAL? (i) discordant, (ii) loud, (iii) lyrical, (iv) verbal, (v) euphonious. 


3.1 Classification of Question Types 

The first component of the framework is a question classifier, 
which identifies different types of verbal questions. Since dif¬ 
ferent types of questions usually have their unique ways of 
expressions, the classification task is relatively easy, and we 
therefore take a simple approach to fulfill the task. Specif¬ 
ically, we regard each verbal question as a short document 
and use the TFTDF features to build its representation. Then 
we train an SVM classifier with linear kernel on a portion 
of labeled question data, and apply it to other questions. The 
question labels include Analogy-1, Analogy-II, Classification, 
Synonym, and Antonym. We use the one-vs-rest training 
strategy to obtain a linear SVM classifier for each question 
type. 

3.2 Embedding of Word-Senses and Relations 

The second component of our framework leverages deep 
learning technologies to learn distributed representations for 
words (i.e. word embedding). Note that in the context of 
verbal question answering, we have some specific require¬ 
ments on this learning process. Verbal questions in IQ tests 
usually consider the multiple senses of a word (and focus on 
the rare senses), and the complex relations among (polyse- 
mous) words, such as synonym and antonym relation. These 
challenges have exceeded the capability of standard word em¬ 
bedding technologies. To address this problem, we propose a 
novel approach that considers the multi-sense nature of words 
and integrate the relational knowledge among words (or their 
senses) into the learning process. In particular, our approach 
consists of two steps. The first step aims at labeling a word 
in the text corpus with its specific sense, and the second step 
employs both the labeled text corpus and the relational knowl¬ 
edge contained in dictionaries to simultaneously learn embed¬ 
dings for both word-sense pairs and relations. 

Multi-Sense Identification 

First, we learn a single-sense word embedding by using the 
skip-gram method in word2vec | |Mikolov et ai, 2013| . 

Second, we gather the context windows of all occurrences 
of a word used in the skip-gram model, and represent each 
context by a weighted average of the pre-learned embedding 
vectors of the context words. We use TF IDF to define the 
weighting function, where we regard each context window 
of the word as a short document to calculate the document 
frequency. Specifically, for a word wq, each of its context 
window can be denoted by {w-n, • • • , wq, • • • , wn)- Then 
we represent the window by calculating the weighted average 
of the pre-learned embedding vectors of the context words as 
below, 

1 " 

c= dwi'^wit (1) 


where g^. is the TFTDF score of Wi, and is the pre- 
learned embedding vector of Wi. After that, for each word, 
we use spherical A:-means to cluster all its context representa¬ 
tions, where cluster number k is set as the number of senses 
of this word in the online dictionary. 

Third, we match each cluster to the corresponding sense 
in the dictionary. On one hand, we represent each cluster by 
the average embedding vector of all those context windows 
included in the cluster. For example, suppose word wq has 
k senses and thus it has k clusters of context windows, we 
denote the average embedding vectors for these clusters as 
^ 1 , • • • ,^k- On the other hand, since the online dictionary 
uses some descriptions and example sentences to interpret 
each word sense, we can represent each word sense by the 
average embedding of those words including its description 
words and the words in the corresponding example sentences. 
Here, we assume the representation vectors (based on the on¬ 
line dictionary) for the k senses of wq are Cij •' • j Cfc- After 
that, we consecutively match each cluster to its closest word 
sense in terms of the distance computed in the word embed¬ 
ding space, i.e., 

tesO') = argmin (2) 

,k 

where d{-, ■) calculates the Euclidean distance and 
is the first matched pair of window cluster and word sense. 
Here, we simply take a greedy strategy. That is, we remove 
^i/ and from the cluster vector set and the sense vector set, 
and recursively run (|2|l to find the next matched pair till all the 
pairs are found. Finally, each word occurrence in the corpus 
is relabeled by its associated word sense, which will be used 
to learn the embeddings for word-sense pairs in the next step. 

Co-Learning Word-Sense Pair Representations and 
Relation Representations 

After relabeling the text corpus, different occurrences of a 
polysemous word may correspond to its different senses, or 
more accurately word-sense pairs. We then learn the embed¬ 
dings for word-sense pairs and relations (obtained from dic¬ 
tionaries, such as synonym and antonym) simultaneously, by 
integrating relational knowledge into the objective function 
of the word embedding learning model like skip-gram. We 
propose to use a function Er as described below to capture 
the relational knowledge. 

Specifically, the existing relational knowledge extracted 
from dictionaries, such as synonym, antonym, etc., can be 
naturally represented in the form of a triplet (head, relation, 
tail) (denoted by {hi,r,tj) G S, where S is the set of re¬ 
lational knowledge), which consists of two word-sense pairs 
(i.e. word h with its i-th sense and word t with its j-th sense), 
h,t G W (W is the set of words) and a relationship r G R 
(R is the set of relationships). To learn the relation represen¬ 
tations, we make an assumption that relationships between 
words can be interpreted as translation operations and they 













can be represented by vectors. The principle in this model is 
that if the relationship {hi, r, tj) exists, the representation of 
the word-sense pair tj should be close to that of hi plus the 
representation vector of the relationship r, i.e. hi + r; other¬ 
wise, hi + r should be far away from tj. Note that this model 
learns word-sense pair representations and relation represen¬ 
tations in a unified continuous embedding space. 

According to the above principle, we define Er as a 
margin-based regularization function over the set of relational 
knowledge S, 

Er = ^ ^ ^-y + d(hi + r,tj) — d{h -|-r, t )j 

Here denotes the positive part of AT, 7 > 0 is a mar¬ 
gin hyperparameter, and d{-,-) is the distance measure for the 
words in the embedding space. For simplicity, we again de¬ 
fine d{-, •) as the Euclidean distance. The set of corrupted 
triplets 5^^ ^ is defined as: 

which is constructed from S by replacing either the head 
word-sense pair or the tail word-sense pair by another ran¬ 
domly selected word with its randomly selected sense. 

Note that the optimization process might trivially minimize 
Er by simply increasing the norms of word-sense pair repre¬ 
sentations and relation representations. To avoid this prob¬ 
lem, we use an additional constraint on the norms, which is 
a commonly-used trick in the literature | |Bordes et al, 20lT| . 
However, instead of enforcing the L 2 -norm of the represen¬ 
tations to 1 as used in | |Bordes et al, 20lT| , we adopt a soft 
norm constraint on the relation representations as below: 

Ti = 2a{xi) - 1 , ( 4 ) 

where a{-) is the sigmoid function a{xi) = 1/(1 -f e““*), 
Ti is the i-th dimension of relation vector r, and Xi is a la¬ 
tent variable, which guarantees that every dimension of the 
relation representation vector is within the range (—1,1). 

By combining the skip-gram objective function and the 
regularization function derived from relational knowledge, 
we get the following combined objective Jr that incorporates 
relational knowledge into the word-sense pair embedding cal¬ 
culation process, 

Jr = OiEr — L, (5) 

where a is the combination coefficient. Our goal is to mini¬ 
mize the combined objective Jr, which can be optimized us¬ 
ing back propagation neural networks. By using this model, 
we can obtain the distributed representations for both word- 
sense pairs and relations simultaneously. 

3.3 Solvers for Each Type of Questions 
Analogy-I 

For the Analogy-I questions like “A is to S as C is to ?”, we 
answer them by optimizing: 

D = argmax cos{v(^B,zk) - V(A,za) + 

i.L i- 7 - r .» • (ZT 


where T contains all the candidate answers, cos means co¬ 
sine similarity, and if,, ia, ic, id' are the indexes for the word 
senses of B, A, C, D' respectively. Finally D is selected as 
the answer. 

Analogy-II 

As the form of the Analogy-II questions is like “A is to ? as C 
is to ?” with two lists of candidate answers, we can apply an 
optimization method as below to select the best {B, D) pair, 

argmax cos{v(^b'+ V(c,ic)AiD'. 

yiayic ,'id' ]B'GTi,D'£T2 

(7) 

where Ti, T 2 are two lists of candidate words. Thus we get 
the answers B and D that can form an analogical relation be¬ 
tween word pair {A, B) and word pair (C, D) under a certain 
specific word sense combination. 

Classification 

For the Classification questions, we leverage the property that 
words with similar co-occurrence information are distributed 
close to each other in the embedding space. As there is one 
word in the list that does not belong to others, it does not 
have similar co-occurrence information with other words in 
the training corpus, and thus this word should be far away 
from other words in the word embedding space. 

According to the above discussion, we first calculate a 
group of mean vectors mi^_^ ^i^^ of all the candidate words 

with any possible word senses as below, 

WjGT 

where T is the set of candidate words, N is the capacity of T, 
Wj is a word in T; (j = 1, • • • ,N;iruj = 1, • • • , kruj ) is 
the index for the word senses of wj, and (j = 1, • • • , N) 
is the number of word senses of Wj. Therefore, the number 
of the mean vectors is M = ■ As both N and 

are very small, the computation cost is acceptable. Then, we 
choose the word with such a sense that its closest sense to the 
corresponding mean vector is the largest among the candidate 
words as the answer, i.e., 

w = argmax min d{v(rri (9) 

Synonym 

For the Synonym questions, we empirically explored two 
solvers. For the first solver, we also leverage the property 
that words with similar co-occurrence information are located 
closely in the word embedding space. Therefore, given the 
question word Wq and the candidate words Wi, we can find 
the answer by the following optimization problem. 

w= argmin i„)), (10) 

where T is the set of candidate words. The second solver 
is based on the minimization objective of the translation dis¬ 
tance between entities in the relational knowledge model Q. 
Specifically, we calculate the offset vector between the em¬ 
bedding of question word Wq and each word wj in the can¬ 
didate list. Then, we set the answer w as the candidate word 


• id' 






with which the offset is the closest to the representation vec¬ 
tor of the synonym relation i.e., 

w= argmin )\ - rs\. (11) 

iw,,iu,y,wjGT 

In practice, we found the second solver performs better (the 
results are listed in Section|4]i. 

Antonym 

Similar to solving the Synonym questions, we explored two 
solvers for Antonym questions as well. That is, the first solver 
(fTSl i is based on the small offset distance between semanti¬ 
cally close words whereas the second solver (fOl) leverages 
the translation distance between two words’ offset and the 
embedding vector of the antonym relation. One might doubt 
on the reasonableness of the first solver given that we aim 
to find an answer word with opposite meaning for the target 
word (i.e. antonym). We explain it here that since antonym 
and its original word have similar co-occurrence information, 
based on which the embedding vectors are derived, thus the 
embedding vectors of both words with antonym relation will 
still lie closely in the embedding space. 

w= argmin j), (12) 

w= argmin )| - ra|, (13) 

where T is the set of candidate words and Tq is the represen¬ 
tation vector of the antonym relation. Again we found that 
the second solver performs better. Similarly, for skip-gram, 
only the first solver is applied. 

4 Experiments 

We conduct experiments to examine whether our proposed 
framework can achieve satisfying results on verbal compre¬ 
hension questions. 

4.1 Data Collection 
Training Set for Word Embedding 

We trained word embeddings on a publicly available text cor¬ 
pus named wiki201^ which is a large text snapshot from 
Wikipedia. After being pre-processed by removing all the 
html meta-data and replacing the digit numbers by English 
words, the final training corpus contains more than 3.4 bil¬ 
lion word tokens, and the number of unique words, i.e. the 
vocabulary size, is about 2 million. 

IQ Test Set 

According to our study, there is no online dataset specifically 
released for verbal comprehension questions, although there 
are many online IQ tests for users to play with. In addition, 
most of the online tests only calculate the final IQ scores but 
do not provide the correct answers. Therefore, we only use 
the online questions to train the verbal question classifier de¬ 
scribed in Section im Specifically, we manually collected 
and labeled 30 verbal questions from the online IQ test Web¬ 
site^ for each of the five types (i.e. Analogy-I, Analogy-II, 


Table 2: Statistics of the verbal question test set. 


Type of Questions 

Number of questions 

Analogy-I 

50 

Analogy-II 

29 

Classification 

53 

Synonym 

51 

Antonym 

49 

Total 

232 


Classification, Synonym, and Antonym) and trained an one- 
vs-rest SVM classifier for each type. The total accuracy on 
the training set itself is 95.0%. The classifier was then ap¬ 
plied in the test set below. 

We collected a set of verbal comprehension questions 
associated with correct answers from the published IQ 
test books, such as | |Carter, 2005 [ [Carter, 2007[ Pape, 1993 


[Ken Russell, 2002[ , and we used this collection as the test set 
to evaluate the effectiveness of our new framework. In total, 
this test set contains 232 questions with the corresponding 
answers.The statistics of each question type are listed in Ta- 
blelH 


4.2 Compared Methods 

In the following experiments, we compare our new relation 
knowledge powered model to several baselines. 

Random Guess Model (RG). Random guess is the most 
straightforward way for an agent to solve questions. In our 
experiments, we used a random guess agent which would se¬ 
lect an answer randomly regardless what the question was. To 
measure the performance of random guess, we ran each task 
for 5 times and calculated the average accuracy. 

Human Performance (HP). Since IQ tests are designed to 
evaluate human intelligence, it is quite natural to leverage hu¬ 
man performance as a baseline. To collect human answers 
on the test questions, we delivered them to human beings 
through Amazon Mechanical Turk, a crowd-sourcing Internet 
marketplace that allows people to participate Human Intelli¬ 
gence Tasks. In our study, we published five Mechanical Turk 
jobs, one job corresponding to one specific question type. The 
jobs were delivered to 200 people. To control the quality of 
the collected results, we took several strategies: (i) we im¬ 
posed high restrictions on the workers - we required all the 
workers to be native English speakers in North American and 
to be Mechanical Turk Masters (who have demonstrated high 
accuracy on previous Human Intelligence Tasks on the Me¬ 
chanical Turk marketplace); (ii) we recruited a large number 
of workers in order to guarantee the statistical confidence in 
their performances; (iii) we tracked their age distribution and 
education background, which are very similar to those of the 
overall population in the U.S. While we can continue to im¬ 
prove the design, we believe the current results already make 
a lot of sense. 

Latent Dirichlet Allocation Model (LDA). This base¬ 
line model leveraged one of the most classical distribu¬ 
tional word representations, i.e. Latent Dirichlet Allocation 
(LDA) I jBlei et al, 2003[ . In particular, we trained word rep¬ 
resentations using LDA on wiki2014 with the topic number 
1000 . 


^http: //en . wikipedia . org/wiki/Wikipedia: Database.Stop-iGEffllJ Model (SG). In this baseline, we applied the 
’http: //wechsleradultintelligencescale . com/ word embedding trained by skip-gram I jMikolov ef a/., 2013| 























Table 3; Accuracy of different methods among different hu¬ 
man groups. 



Analogy-I 

Analogy-ll 

Classification 

Synonym 

Antonym 

Total 

RG 

24.60 

11.72 

20.75 

19.27 

23.13 

20.51 

LDA 

28.00 

13.79 

39.62 

27.45 

30.61 

29.31 

HP 

45.87 

34.37 

47.23 

50.38 

53.30 

46.23 

SG 







SG-1 

38.00 

24.14 

37.74 

45.10 

40.82 

38.36 

SG-2 

38.00 

20.69 

39.62 

47.06 

44.90 

39.66 

Glove 

45.09 

24.14 

32.08 

47.06 

40.82 

39.03 

MS 







MS-1 

36.36 

19.05 

41.30 

50.00 

36.59 

38.67 

MS-2 

40.00 

20.69 

41.51 

49.02 

40.82 

40.09 

MS-3 

17.65 

20.69 

47.17 

47.06 

30.61 

36.73 

RK 

48.00 

34.48 

52.83 

60.78 

51.02 

50.86 


(denoted by SG-1). In particular, when using skip-gram to 
learn the embedding on wikilOM, we set the window size as 
5, the embedding dimension as 500, the negative sampling 
count as 3, and the epoch number as 3. In addition, we also 
employed a pre-trained word embedding by Googl^l with the 
dimension of 300 (denoted by SG-2). 

Glove. This baseline algorithm uses another powerful 
word embedding model Glove ( Pennington et al, 2014| . The 
configurations of running Glove are the same with those in 
running SG-1. 

Multi-Sense Model (MS). In this baseline, we ap¬ 
plied the multi-sense word embedding models proposed 


in I Huang ef a/., |Tian et al, 2014[[Neelakantan et ai, 2014| 

(denoted by MS-1, MS-2 and MS-3 respectively). For MS-1, 

we directly used the published multi-sense word embedding 
vectors by the authorfl in which they set 10 senses for the 
top 5% most frequent words. For MS-2 and MS-3, we get 
the embedding vectors by the released codes from the authors 
using the same configurations as MS-1. 

Relation Knowledge Powered Model (RK). This is our 
proposed method in Section [3 In particular, when learning 
the embedding on wiki2014, we set the window size as 5, the 
embedding dimension as 500, the negative sampling count 
as 3 (i.e. the number of random selected negative triples 
in S'), and the epoch number as 3. We adopted the online 
Longman Dictionary as the dictionary used in multi-sense 
clustering. We used a public relation knowledge set, Wor- 
dRep ||Gao et al, 2014|, for relation training. 


4.3 Experimental Results 
Accuracy of Question Classifier 

We applied the question classifier trained in Section IATI on 
the test set in Table |2] and got the total accuracy 93.1%. For 
RG and HP, the question classifier was not needed. For other 
methods, the wrongly classified questions were also sent to 
the corresponding wrong solver to find an answer. If the 
solver returned an empty result (which was usually caused 
by invalid input format, e.g., an Analogy-II question was 
wrongly input to the Classification solver), we would ran¬ 
domly select an answer. 

Overall Accuracy 

Table [^demonstrates the accuracy of answering verbal ques¬ 
tions by using all the approaches mentioned in Section 14.21 

^https://code.google.eom/p/word2vec/ 

^http://ai.stanford.edu/-ehhuang/ 


From this table, we have the following observations: (i) 
RK can achieve the best overall accuracy than all the other 
methods. In particular, RK can raise the overall accuracy 
by about 4.63% over HP. (ii) RK is empirically superior 
than the skip-gram models SG-l/SG-2 and Glove. Accord¬ 
ing to our understanding, the improvement of RK over SG- 
l/SG-2/Glove comes from two aspects: multi-sense and rela¬ 
tional knowledge. Note that the performance difference be¬ 
tween MS-1/MS-2/MS-3 and SG-l/SG-2/Glove is not signif¬ 
icant, showing that simply changing single-sense word em¬ 
bedding to multi-sense word embedding does not bring too 
much benefit. One reason is that the rare word-senses do not 
have enough training data (contextual information) to pro¬ 
duce high-quality word embedding. By further introducing 
the relational knowledge among word-senses, the training for 
rare word-senses will be linked to the training of their related 
word-senses. As a result, the embedding quality of the rare 
word-senses will be improved, (iii) RK is empirically supe¬ 
rior than the two multi-sense algorithms MS-1, MS-2 and 
MS-3, demonstrating the effectiveness brought by adopting 
less model parameters and using online dictionary in build¬ 
ing the multi-sense embedding model. 

These results are quite impressive, indicating the potential 
of using machine to comprehend human knowledge and even 
achieve the comparable level of human intelligence. 

Accuracy in Different Question Types 

Table |3 reports the accuracy of answering various types of 
verbal questions by each comparing method. From the ta¬ 
ble, we can observe that the SG and MS models can achieve 
competitive accuracy on some certain question types (like 
Synonym) compared with HP. After incorporating knowledge 
into learning word embedding, our RK model can improve 
the accuracy over all question types. Moreover, the table 
shows that RK can result in a big improvement over HP on 
the question types of Synonym and Classification, while its 
accuracy on the other question types is not so good as these 
two types. 

To sum up, the experimental results have demonstrated the 
effectiveness of the proposed RK model compared with sev¬ 
eral baseline methods. Although the test set is not large, the 
generalization of RK to other test sets should not be a concern 
due to the unsupervised nature of our model. 

5 Conclusions 

We investigated how to automatically solve verbal compre¬ 
hension questions in IQ Tests by using the word embedding 
techniques in deep learning. In particular, we proposed a 
three-step framework: (i) to recognize the specific type of a 
verbal comprehension question by a classifier, (ii) to leverage 
a novel deep learning model to co-learn the representations 
of both word-sense pairs and relations among words (or their 
senses), (iii) to design dedicated solvers, based on the ob¬ 
tained word-sense pair representations and relation represen¬ 
tations, for addressing each type of questions. Experimental 
results have illustrated that this novel framework can achieve 
better performance than existing methods for solving verbal 
comprehension questions and even exceed the average per¬ 
formance of the Amazon Mechanical Turk workers involved 


























in the experiments. 
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